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(54) Rflethod and apparatus tor reordering memory opemtlons In b superscalar or very long 
instruction word processor 

(57) A method and apparatus for reordering mem- 
ory operations in si$>erscalar or very long Instruction 
word (VLiW) processors is described, incorporating a 
mechanism that allows for arbitrary distance between 
reading from memory and using data loaded out-of- 
order, and that allows for moving load operations earlier 
in the execution stream. This mechanism tolerates 
an4)iguous memory references. The mechanism exe- 
cutes only one additional instruction for disambiguation 
purposes, thus producing good performance, and inte- 
grates memory disambiguation with speculative execu- 
tion of instructions. The overhead introduced is only one 
inslnjction, and the load operation can be artMtrarily 
moved earlier in the instruction stream. The mechanism 
can cope with conflicts that occur as a result of an unex- 
pected combination of store/load instructions, can be 
used in a coherent multiprocessor context, and com- 
bines speculative execution with reordering of memory 
operations in a way which requires simple hardware 
support. 
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Description 



The present invention generally relates to reorder- 
ing memory operations in a superscalar or very long 
instruction word (VLIW) processor in order to exploit 
instruction-level parallelism in programs and. more par- 
ticularly to a method and apparatus for reordenng 
memory operations in spite oi artitrarily separated or 
ambiguous memory references, thereby achieving a 
significant improvement in the performance of the com- 
puter system. The method and apparatus are applicable 
to uniprocessor and multiprocessor systems. 

High perlormance contemporary processors rely 
on superscalar and/or very long instruction word (VLIW) 
techniques for e)(ploiting instruction level parallelism in 
programs; that is. for executing more than one instruc- 
tion at a time. These processors contain multiple func- 
tional units, execute a sequential stream of instructions, 
are abJe to fetch two or more instmctions per cycle from 
memory, and are able to dispatch two or more instruc- 
tions per cycle subject to dependencies and availatelity 
of resources. These capabilities are exploited by com- 
pilers which generate code that is optimized for super- 
scalar and/or VLIW features. 

In sequential programs, a memory load operation 
re^s a datum from menrwry loads it in a processor reg- 
ister, and frequently starts a sequence of operations 
that depend on the datum loaded. In a superscalar or 
VLIW processor in which there are resources available, 
it is advantageous to initiate memay load operations as 
early as possible because that may lead to the use of 
otherwise Idle resources and may hide delays in 
accessing memory (including potential cache misses), 
thus redudng the execution time of programs. TTie toad, 
as well as the operations that depend on the load, are 
executed earlier than what they would have been in a 
strictly sequential program, achieving a shorter execu- 
tion time. This requires the ability to perform non-block- 
ing loads (i.e.. continue issuing instructions beyond a 
load which produces a cache miss), the ability to issue 
loads ahead of preceding stores (i.e., out-of-order 
loads), the ability to move toads ahead of preceding 
branches (uq., speculation), and the ability to move 
operations that depend on a load ahead of other opera- 
tions. In other words, what is needed is the ability to 
reorder the operations of the program. 

Several factors limit the ability to perform reordering 
of memory operations, in particular factors arising from 
run-time dependencies in the execution of a program. 
These include moving operations ahead of conditional 
branch instructions and ambiguous memory references. 

. Moving an operation ahead of a preceding condi- 
tional branch instruction introduces peculation in the 
execution of a program, because the operation is exe- 
cuted before it is known whether it will be really 
required. The code motion is perform^J under the 
expectation that the operatron will be needed. Register- 
to-register operations with no side-effects can be exe- 
cuted speculatively, as long as the results are saved in 



unused fdeadT registers. If an operation v^s not 
required. the resuft is just ignored. On the other hand, 
register-tb-register operations with side-effects and 
memory toad operations can be executed speculatively 
5 only if there exist mechanisms to recover from skle 
effects which should not have been produced, such as 
exceptions (errors), protection violations, or accesses to 
volatile memory locations. 

Moving a memory load (deration ahead of a pre- 
TO ceding rtiemory store operation faces the problem of 
ambiguais references in the execution of the program if 
it is not possible to detemriine at conpile time that the 
memory locations accessed by the load and store are 
different Unambiguous memory references can be exe- 
15 cuted wt-cf-order because they do not conflict. On the 
other hand, ambiguous memory operations can be exe- 
cuted oit-of-order only if there exist mechanisms to 
detect a potential conflict ignore the data loaded ahead 
of time, ar^ reload the correct value after the store 
20 operation has been performed- The conflict may be in a 
single byte of a multiple byte operand, so the store oper- 
ation rruist be completed before the load operation can 
be performed. 

Although the two problems described above are dif - 
25 ferent. their effects and requirements are the same. 
Namely, there must exist mechanisms to detect and 
recover from the side effects or ambiguties. In the fol- 
lowing discussion, both of these problems are referred 
to as "reordered memory accesses problems^ 
30 Contenporary compilation techniques inclide 
static memory disambiguation algorithms for reordering 
memory operations. These algorithms determine if two 
memory references, a memory store operation loitowed 
by a rnenwy load operation, access the same location. 
35 If the references do not conflict (i.e.. they address differ- 
ent menrcry locations), then it is possible to reorder the 
operatjons so that the load can be executed ahead of 
the store. Static disambiguation works well only if the 
memcay access pattern is predictsdDle. Frequently, that 
40 is not' the case, and the conpiler/larogranmner must 
make the conservative assumption thsX their references 
actually conflict so they must be executed sequentially 
fin tfiisir original order), wrtch reduces the potential 
instrtKrtion-level parallelism in the program. 
45 Rieordering of memory operations has been a sub- 
ject of active interest. See. for exanple. the article by K. 
Diefehdortf and M. Allen entitled "Organization of the 
Motorola 88110 superscalar RISC microprocessor". 
IEEE Micro.. April 1992. pp. 40-63. The dynantic sch^J- 
so uler in the Motorola 88110 processor dispatches store 
instructions to a store queue where the store operations 
might stall if the operand to be stored has not yet been 
produced by another operation. Subsequent load 
instructions can bypass tiie store and immediately 
55 access the memory, achieving dynamic reordering of 
mernbry accesses. An address comparator detects 
address hazards and prevents toads from going ah^ 
of stw-es to the same address. The queue holds three 
outstanding store operations, so that this structure 
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allows runtime overlapping of tight loops. The structure 
does not really move a load earlier in the sequential 
execution stream; instead, it only allows for a load oper- 
ation not to be delayed as a result of a stalled store 
operation. ^ 

The static motion of load/store operations out from 
loops under certain conditions, was described by K. 
Ebciogiu R. Groves. K. Kim, G. Silberman, and I. Ziv m 
"VLIW compilation techniques in a superscalar erwiron- 
ment" SIGPLAN Conference on Programming Lan- io 
guage Design and Implementation (PLDI 94). 1994. 
This approach is basically a generalization of the static 
movement of loop-invariant instructions out of loops, 
with the additional capability of moving loads and stores 
wtiich are executed conditionally if they are considered 15 
safe The conditions required for this optimization 
include guaranteeing that there is no possibility for con- 
flicting memory references (arr^iguous memory refer- 
ences), which is not always possible. 

A compilation technique which allows scheduling of ^ 
speculative loads without modifying the architecture of 
the processor Is described by D. Bernstein. M. Rodeh 
and M. Hopkins in their patent application entitled 
"Inst-uction scheduler for a computer", Serial No. 
08/364.S36 filed December 27. 1994. as a continuation 25 
of application Serial No. 07/882.739 filed May 14. 1992, 
and assigned to the assignee of this application. In this 
approach, the suitabilility of a load operation for specu- 
lative execution is determined by classifying it into a 
nurr*)er of categories depending on conditions applied so 
to the base register used by tiie operation and/or the 
contents of such a base register. Thus, as in the tech- 
niques described by K, Ebciogiu et al., supra, this 
approach is restricted to those cases that can be 
detected at compile time. 

A hybrid memory disanrfeiguation technique called 
"speculative disanrTbiguation" viras proposed by A. 
Huang. G. Slavenburg, and J. Shen in "Speculative dis- 
arrtoiguation: a compilation technique for dynamic 
memory disamWguation". 21st Intl. Symposium on 40 
Computer Architecture, Chicago, pp. 200-210. 1994. 
This approach uses a combination of hardware and 
compiler techniques to achieve its objective. It performs 
transformations on tine code to antidpate either out- 
come of an ambiguous memory reference, requiring 45 
guarded execution capabilities in the hardware. For 
each pair of ambiguous memory references, the com- 
piler creates two versions of the code ttiat depends on 
the memory reference One version assumes that the 
addresses overlap, whereas the other version assumes so 
they do not overlap. In both versions, operations that do 
not have side effects are executed, while operations that 
have side effects are guarded by the result of comparing 
the two addresses. This approach requires more (dera- 
tions and resources than the original program, in addi- ss 
tion to capabilities for guarded execution, deals only 
with disambiguation, but does not have capabilities for 
moving load operations ahead of branches. 



Another alternative to perform compiler optimiza- 
tion of program execution by altowing load operations to 
be executed ahead of store operations is descnbed by 
A NIcolau in "Run-time disant^guation: coping witfi 
statically unpredictable deper^endes-, IEEE Trans- On 
Computers, vol. 38. May 1989. This approach relies on 
compiler identification of a kad. which can be moved 
ahead of a store operation, and compiler insertion of the 
necessary code, so that the processor can check at run- 
time if there is a match anxjng the address of the load 
and store (derations, as described by A, Huang et al.. 
supra, but without guarded-execution capabilities. If 
there is no match, tine processor executes a sequence 
of instructions in which the load has been moved ahead 
of the store. On the otiier hand, rt there is a match, the 
processor executes a sequertce of instructions in which 
the load operation is performed after the store opera- 
tion. Since the check for the address match is per- 
forrrred by the processor, this approach leads to 
potenti^ performance de^adaticn due to the execution 
of more instructions and ther assodated dependencies 
(e.g., the explicit gen^tion of the memory addresses 
and 'the address compare). Moreover, tine reordered 
load operation cannot be perfomned until the memory 
addresses for both load and store operations have been 
resolved. 

A method and apf^ratus for improving the perform- 
ance of out-of-order operations is described by M. 
Kumar, M. Ebdoglu. and E. Kronstadt in their patent 
application entitled "A method and apparatus for 
impravir\g perfornriance of out-of-sequence load opera- 
tions in a computer system". Serial No. 08/320,1 1 1 fil^ 
October 7, 1994, as a corTtirtuation of application Serial 
No. 07/880,102 filed May 6. 1992, and assigned to tine 
assignee of this applicatfon. This method and apparatus 
uses conpiier techniques, four new instructions, and an 
address compare unit. The ccwrpiler statically moves 
memory load operations ahead of memory store opera- 
tions, marking all of them as out-of-order instructions. 
The addresses of operands loaded out-of-order are 
saved to an assodative memory. On request, tiie 
address compare unit compares the addresses saved 
in the assodative menrory with the address generated 
by store operations. If a conflict is detected, recovery 
code is executed to correct the problem. The system 
dears addresses saved in the assodative memory 
when tiier© is no longer a need to compare those 
addresses. This approach only addresses the problem 
of reordering memory qserations. It does not indude 
tiie ability to speculatively execute memory load opera- 
tions. Moreover, this approach requires spedal instruc- 
tions to trigger the checking for conflicts in addresses, 
as well as to clear the address of an operand no longer 
needed, and inrposes a burdCT on the com^ler which 
has to detect and pair an potential conflicts. As a conse- 
quence, this approach cannot cope with conflicts that 
occur as a result of an unexpected combination of 
Store/load instructions (p^tiaps produced by enor), nei- 
ther can it be used in a coherent multiprocessor context. 



By: 



18006661233; 



Mar-10-04 16:14; 



Page 



EP0 742 512 A2 



As a related sul^ect. a hardware mechanism cou- 
pled with compiler support is described by G. Siiberman 
and M, Ebcioglu in th^r patent application entitled 
"Handling of exceptions in speculative instructions", 
Serial No. 08/377.563 filed on January 24. 1995. and 
assigned to the assignee of this application. This mech- 
anism reduces the overhead from exceptions originated 
by instructions executed speculatively. The mechanism 
relies on hardware resources such as an additional bit 
per register to indicate an exception generated dunng 
the speculative execution of an instruction, two addi- 
tional register files to save the register operands so that 
speculative insUuctions invalidated by an exception can 
be re-executed, as well as information that allows trac- 
ing back to the source of the exception. This mechanism 
is applicable only to speculative instructions, not to reor- 
dered memory operations. 

A method and apparatus for reordering load 
instructions is described in the patent application enti- 
tled "Memory processor that permits aggressive execu- 
tion of load instructions" by F. Amerson. R. Gupta, V. 
Kathal and M.Schiansker (UK Patent Application GB 
2265481 A. No. 9302148.3, filed on 04/02/1993). This 
patent application desaibes a memory processor for a 
computer system in which a compiler moves long- 
latency load instructions earlier in the instruction 
sequence, to reduce the loss of efficiency resulting from 
the latency of the load. The memory processor saves 
load instructions in a special register file for a period of 
time sufficient to detemiine if any subsequent store 
instruction that would have been executed prior to the 
load references the same address as that specified by 
the load instruction. If so. the memory processor rein- 
serts the original load in the Instruction stream so that it 
gets executed in-order. Thus, this system permits mov- 
ing loads ahead of stores under compiler control, and 
relies on hardware to insert code to recover from a con- 
flict. However, this system does not permit reordering 
other instructions that depend on the load (the hardware 
resources are aWe to reinsert only the load instruction), 
neither it allows for speculative execution of loads or 
other instructions. In other words, the method and appa- 
ratus is limited to hiding the latency of load instoictions. 
whose maximum value must be known at conpile time. 

it is therefore an object of tfie present invention to 
provide a mechanism that allows moving load instruc- 
tions earlier in the execution stream, allowing for arbi- 
trary distance between reading from memory and using 
data loaded out-of-order. 

It is another object of the present invention to pro- 
vide a mechanism whicti is not limited to moving load 
operations out from loops and which can tolerate 
arribiguous memory references. 

It is a further object of the invention to provide a 
mechanism that executes only one additional ir«truc- 
tion for disambiguation purposes, thus producing better 
perforrrance, and wl^ch integrates memory disam- 
biguation with speculative execution. 



It is yet another object of the invention to provide a 
mechanism wherein the overhead introduced is only 
one instruction and the load operation can be arbitrarily 
moved earlier in the instaiction stream. It is still another 
5 object of the invention to provide a mechanism which 
can cope with conflicts that occur as a result of an unex- 
pected combination of store/load instaictions and which 
can be used in a coherent multiprocessor context, 
it is another object of me invention to provide a 
w mechanism which combines speculative execution with 
reordering of memory operations in a way that requires 
rather simple inplementations. 

According to the invention, there is provided a 
method and apparatus for reordering memory opera- 
j5 tions in a superscalar or very long instruction word 
(VLIW) processor even for ari:>itrarily separated and 
anijiguous memory references. This reordering 
reduces the critical path length of programs by moving 
sequences of dependent operations earlier in the pro- 
20 gram execution, thereby increasing the performance of 
the computer system. The method and apparatus inte- 
grates reordering memory operations with speculative 
execution, and is applicable to uniprocessor and multi- 
processor systems. The apparatus con^sls of a multi- 
25 pie-entry address comparator which chQd<s for conflicts 
in addressing memory, a status bit per comparator entry 
to indicate conflicts generated by reordered memory 
operations, a status bit per register in the register file of 
the processor to indicate pending exceptions, special 
30 instructions to load a register out-of-order, copy a regis- 
ter loaded out-of-order, and commit a register lo^ed 
out-of-order, and connpiler support to generate the code 
that uses these resources as well as code to recover 
from exceptions arising while executing an instruction 
35 out-of-order. 

Out-of-order memory operations raise the following 

requirements: 



exceptions (side effects) generated by out-of-order 
40 load operations should not be reported (performed) 
until ti^e data loaded is used for an in-order (non- 
^eculative) operation: 

conflicts due to overlapping addresses when a load 
45 Operation is moved ^dovb a store operation must be 
detected: 

data loaded ah^d of a store must be checked for 
validity before being used in-order (in other words. 
so a check that has not become stale due to an over- 
lapping store); and 

volatile locations cannot be loaded speculatively. 

55 The approach taken in this invention relies on tiie 
following: 

static reordering of code by the compiler to exploit 
instruction-level parallelism; 
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hardware support to detect conflicts in ambiguous 
memory accesses, report delayed exceptions, and 
manipulate data loaded out-o1-order; and 

compiler generation of code for manipulating the s 
data loaded out-of-order and for the recovery from 
delayed exceptions. 

The foregoing and other objects, a^ects and 
advantages of this invention will be better understood io 
from the following detailed description of a preferred 
embodiment of the invention with reference to the draw- 
ings, in which: 

Figures 1 A and 1 B. taken together, are a flow dia- is 
gram showing the logic for the execution of in-order 
and out-of-order operations performed by the 
present invention; and 

Figure 2 is a block diagram showing the hardware 20 
resources supporting the execution of out-of-order 
load operations illustrated by the flow diagram of 
Figure 1. 

Referring now to the drawings, and more particu- 2S 
larly to Figures 1 A and 1 B. there is shown a flow dia- 
gram of the out-of-order load and other operations 
performed by the present irwention. 

Every instruction issued by the processor, and 
every store operation issued by another processor in a so 
coherent multiprocessor system, s decoded in function 
block 1 01 . If the instruction is an outof-order load oper- 
ation as determined in decision block 102, and the 
instructton generates an exception as determined in 
decision block 103, then the delayed exception (DX) bit 3S 
associated with the target register of the load instruction 
is set in function block 1 04. but no exception is raised to 
the processor. On the other hand, if the instruction does 
not generate an exception, then the range of memory 
addresses referenced by the instruction are saved in an 4o 
entry in the address comparator (AC), and the valid bit 
of this entry is set to valid in function block 105 (l a, the 
address comparator entries act as a cache of memory 
addresses recently loaded out-of-order) . 

If the instruction is a store operation as determined 4s 
in decision block 106, the range of memory addresses 
referenced by the instruction is compared with all the 
entries in the address corr^arator in dedsion bJock 107. 
For each entry matching tfiis range, the corresponding 
valid bit is set to invalid in function block 108. Decision &) 
block 1 09 checks if all addresses have been compared. 

If the instruction is a commit operation as deter- 
mined in decision block 1 10. the valid bit in the address : 
comparator entry associated to the source register of : 
the instruction is checked in decision block 1 1 1 . If the bit 55 
is set to false, a delayed exception is generated in func- ; 
tion block 1 12. At the same time, the delayoi exception 
bit of the source register of the commit instruction is > 
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also checked in decision Wock 113. If this bit is set. a 
delayed exc^on is generated in function Wock 114. 

If the op^-ation is any other operation as deter- 
mined in dedsion btock 1 1 5. the delayed exception bit of 
all the source registers of the instruction is checked in 
decision Wock 116. If any of these bits is set. the 
delayed exception bit of the target register of the instnjc- 
tton is set in function block 117. but no exception is 
raised to the processor; othenAnse. the delayed excep- 
tion bit of the target register of the instruction is set to 
false in function Wock 1 18. 

H a delayed exc^stion is raised to the processor as 
determined in decision Wock 119. the excepting instnjc- 
tion is aborted and execution control is transferred to an 
exception harKller in function block 120. This exception 
handler is in charge of executing "recovery code" which 
repeats the toad operation which generated the excep- 
tion, as well as any operation that depends on the load 
and which was executed before the exception was 
raised. 

From the ftow diagram shown in Figure 1, the fol- 
lowing features of this invention are inferred: 

Reporting Exc^ons: ErrcM^ (sWe effects) arising 
during the out-of-order execution of a load opera- 
tion (such as txotection violations) are not reported 
until tiie data loaded out-of-order is needed for an 
in-order operation, at the original place of the load 
instruction in a sequential instruction stream, if 
there were errors, the load instruction is re-exe- 
cuted at ti^at pdnt as well as any other instruc- 
tion(s) alreacSy executed tiiat depend on the load 
wfrich was executed out-of-order. 

To execute out-of-onder load instructions that might 
raise exceptions, target registers are tagged witii a 
•Delayed Exc^ion" bit. This bit is used to report, in 
delayed oann^. tfie exceptions that occurred during 
execution of an out-of-order load instruction. The 
ddayed exceptkxi bit is set when ti^e out-of-order load 
generates an exception, and is propagated whw a reg- 
ister is used in other c^eratioris. A commit operation 
checks the delayed exception bits of its operand. If the 
delayed exo^on bit is set. then a delayed exception is 
generated. The exceptkjn handler is in charge of re-exe- 
cuting the load iristruction that raised the exception 
being reported in a delayed manner, as well as any 
other instruction(s) already executed that depended on 
it 

Ccnflkrts Due to Overlapping Storage Addresses: 
Lrad operations moved ahead of sAore operations are 
sutHect to conficcts due to overlapping memory 
addresses. Confflkrts rmy also arise due to store opera- 
tions p^formed by other processors in a coherent nujl- 
tiproc^sor environment In both cases, an out-of-order 
load mSghl access data that becomes stale as a conse- 
quence of a store operation to the same memory loca- 
tion (by the sama or another processor) in the interval 
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from the execution of the out-of-order load to the first in- 
order use of the data loaded. 

Conflicts arising from overlapping load/store 
addresses are detected dynamically by using a multi- 
ple-entry "Address Comparator (AC). When executing 
an out-of-order load operation, the range of real 
addresses of the operand loaded is saved in an address 
comparator entry. 

When a store operation is executed, the range of 
real addresses of the store operand is compared 
against the contents of all entries in the address compa- 
rator. Similarly, in a coherent multiprocessor system, the 
address comparator entries are also checked when 
store references are received from other processors in 
the system. For each address comparator entry, if an 
overlap in the real address is detected, the entry is 
marked invalid. 

Committing Data Loaded Out-of-Order: An operand 
loaded out-of-order, as well as any value derived from it, 
must be "committed" before it can be used as an oper- 
and in a non-speculative (in-order) instruction. That is. 
the program must verify that the address comparator 
entry associated with a particular register is still valid at 
the point where the data is used (usually, at the original 
place in the program). A special instruction is used for 
these purposes, which optionally copies the data 
loaded out-of-order into another register and at the 
same time verifies the validity of the associated address 
comparator entry. If the entry is valid, the commit (co^) 
operation proceeds. On the other hand, if the entry is 
invalid, a delayed exception is raised. The exception 
handler re-executes that load operation and any other 
operation(s) already executed that depend on the load. 

Volatile Loads: Loads from volatile locations are not 
executed out-of-order. Any attempt to load out-of-order 
from a volatile location (detected by the storage protec- 
tion mechanism) simply sets the delayed exception bit 
of the associated register No storage access is per- 
form^J. 

Hardware Implementation 



A processor provided with hardware resources to 
suppwt out-of-order load and other operations as 
desaibed above is shown In Rgure 2. The processor 
includes a plurality of functional units, such as fixed and 
floating point arithmetic logic units (ALUs). Six func- 
tional units 201 to 206 are shown, but those skilled in 
the art will understand that there may be more or fewer 
functional units, d^ending on a specific processor 
design. These functional units access data from a data 
cache 207, and in the case of a multiprocessor system, 
the data cache Is connected to other processors. The 
functional units are connected to general purpose regis- 
ters (GPRs) 208, floating point registers (FPRs) 209, 
and condition registers 210, as appropriate to the func- 
tior^al unit. 

The structure described tiius far is conventional 
and well understood in tiie art The invention adds an 



address comparator (AC) buffers 211 and delayed 
exception (DX) bits 212. 213 and 214 re^>ectively to 
QPRs 208. FPRs 209 and condition registers 21 C.^More 
particularty. a delayed exception bit is associated with 
5 each register which can be the target of an operation 
executed out-of-order. Delayed exception bits are 
accessible as special purpose registers, and they are 
saved and restored as part of the state of the processor 
at context switching. 
70 An address comparator 211 entry is associated 
with each register which is loaded out-of-order. Figure 2 
depicts a static association, in which each register has 
a unique (fixed) associated entry. (Alternatively, address 
comparator entries can also be assign^J dynamically, 
15 at execution time, to each register tiiat requires rt.) In 
tills embodiment of the invention, each AC entry con- 
sists of (1) the range of real addresses of tiie operand 
loaded out-of-order; (2) a valid bit indicating if any byte 
of the operand loaded out-of-order has been modified 
20 by a subsequent store operation, either from the same 
processor or from ano^er processor in a coherent mul- 
tiprocessor system; and (3) a conparator. which checks 
for a match among tiie range of addresses covered by 
the AC entry and tiie range of addresses of each store 
25 Operation. 

In the embodiment shown in Figure 2, if an imple- 
mentation contains fewer AC entries than the number of 
registers which can be loaded out-of-order, then regis* 
ters without an AC entry have only the associated valid 
30 bit which is permanently set to invalid. In this way. any 
access to a nonexisting AC entry will r^ort an invalid 
entry and generate a delays exception. 

Out-crf-order instiuctions use tiie DX bit of the reg- 
ister op^ands as follows. If the instruction generates an 
35 exception, wily the DX bit of the target register is set but 
the exception is not raised. If the DX bit of any operand 
used by tine instruction is set (which indicates tiiat a 
delayed exception was already generated), the DX bit of 
the target register is also set; that is. the delayed excep- 
40 ton is propagated through the out-of-order operations. 
In addition to tiie resources described above, the 
invention includes tiie following insti\ictions which are 
executed by tiie processor: 



45 Load Register Out-of-order • This instruction loads 
a memory location into a register and stores the 
range of real addresses of the operand in con-e- 
^jonding AC entry, which is marked valid. In prac- 
tice, lead instructions are extended by adding one 

50 bit to indicate the in-order/out-of -order nature of the 
instruction. 

Move Register Out-of-order - This instruction cop- 
ies tiie contents of tiie source register and its asse- 
ss dated AC entry into the target register and 
associated target AC entry. Due to its functionality, 
this is always an out-of-order Instruction. The 
delayed exception (DX) bit of the source register is 
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copied into the delayed exception bit of the target 
register, but no exceptions are raised. 

Commit Register ■ This instruction is executed only 
in-order It copies the contents of a register loaded 
out-of-order into another register, checks that the 
address comparator entry associated with the 
source register has not been invalidated by a store 
to an overlapping address, and checks that the 
delayed exception bit of the source register is not 
set if the address conparator entry is valid and the 
delayed exception (DX) bit is not set. the move reg- 
ister operation proceeds. H the AC entry is not valid 
or the DX bit is set. a delayed exception is gener- 
ated. 

Invalidate Address Comparator Entries - This 
Instruction invalidates the contents of the entire 
address comparator. It is used by the system soft- 
ware if the AC is not saved when context is 
switched, to avoid having an AC entry from the dd 
context generate an exception in the new context. 

Exarrple of Speculation of Load Operations 

The use of the resources described above is now 
illustrated by way of example. Consider the (original) 
code shown below in the left hand column which con- 
tains a toad instruction and some arithmetic inslmctions 
that d^end on the load below a store instruction. In this 
example, the first register after the name of the instruc- 
tion is the target register, whereas the ren^ining regis- 
ters are the operands. 



The original toad instruction is replaced by a commit 
instruction, ss shown in the right hand column. 

Assume that the arithmetic instructions that follow 
the icMd are also reordered as shown below: 



10 



Original Code 


Load Moved 
Above Store 




load? r25,10(r4) 


store r3,20{r2) 


store i3,20(r2) 


load r5.10(r4) 


commit r5,r25 


add r6,r5.20 


add r6,r5.20 


sub r7,r6,r7 


sub r7,r6.r7 



15 



20 



25 



30 



35 



40 



45 



50 



The load instruction is movKl above the store, as 
depicted in the right hand column. In this exan^le, n is 
assumed that the target register is renamed. The new 
target register is loaded out-of-order, as indicated by the 
question mark in the load opcode in the right hand col- 
umn. As a consequence, the data is loaded into the tar- 
get register while the range of addresses of the load^ 
operand is saved in an AC entry, which is nrarked valid. 



Previous Code 


Operations Moved Above 
Commit Instruction 


load? r25.10(r4 


load? f25.10(r4) 




add? r26.r25.20 




sub? r27.r26.r27 


store r3.20(r2) 


Store r3.20{r2) 


commit r5,r25 


commit r5,r25 


add r6,r5.20 


copyr6,r26 


sUb r7.r6.r7 


copy r7,r27 

\ ■■■ 



Furthemwe. assume that these instructions can- 
not raise exceptions. In this case, the commit operation 
is followed by copy regSster operations which copy the 
results from the out-of-order operations to their destina- 
tion re^st^s, tf necessary. Such copy opOTtions can 
be removed by copy-propagation steps during code 
optinrazation p^^ormed by the compiler. In contrast to 
the comnnit out-of-order register instruction, the copy 
register operations just copy the source register into the 
destination register, without checking the address oom- 
f»ator, bdtause the corresponding operands are 
implidtly other validated or inveJidated by the preceding 
commit operation. 

H the store operation overlaps the location loaded 
out-of-order, then the corresponding AC entry is marked 
invalid as a scde effect of the store operation. Conse- 
quently, the €Kecution of the commit instruction raises a 
delayed exception. The handler associated with this 
exception must contain recovery code which executes 
the load op^tion as well as the two operations that 
d^end on the load, which were executed before the 
exception ^^os raised. For this purpose, the operands of 
the instnxctions that are re-executed in the recovery 
code must still be available ^er in the same registers 
or in oth®" locattions. 

As m erample, conad^- the recovery code below 
for the reordered code shown above: 



55 



Sent By: 



18006661233; 



.Mar-10-04 16:16; 



Page 10/18 



13 



EP0 742 512 A2 



14 



Reordered Code 


Recovery Code 


load? r25,10(r4) 






add? r26.r25.20 






sub? r27,r26.r7 






store r3,20(r2) 








revr: 


Ioadr25,10(r4) 


commit r5,r25 




add f26,r25.20 


copy r6,r26 




sub r27,r26,r7 


copy r7.r27 




return 



Note that every instruction which depends on the 
out-of-order load is re-executed as part of the recovery 
code, which then returns to re-execute the commit oper- 
ation'. Alternatively, as a further optimization, whenever 
possiWe the recovery code can directly update the orig- 
inal target registers and sWp the copy register opera- 
tions which were not removed by copy-propagation 
cptinnizations. 

Ctafnis 



1. A method of reordering memory operations in 
superscalar or very long instruction word (VUW) 
processors even for arbitrarily separated and 
ambiguous memory references comprising the 
steps of: 

decoding instructions issued by a processor; 
determining if a decoded instruction is an out-of- 
order load instruction and. if so, determining if the 
out-of-order load instruction generates an excep- 
tion; 

setting a delayed exception bit associated with a 
target register of the load instmction for an out-of- 
order load instruction which generates an excep- 
tion; 

saving a memory acSdress of an out-of-order load 
instruction which does not generate an exception in 
an address comparator and setting a valid bit for the 
saved memory address in the a<Wress conparator; 
determining if a decoded instruction is a store oper- 
ation; 

comparing a range of memory addresses refer- 
enced by a decoded store instruction with ail 
entries in the address comparator; 
for each match of an entry in the address compara- 
tor, setting the valid bit of the core^onding entry to 
invalid; 

determining if a decoded instruction is a commit 
operation; 



checWng the valid bit of the address conrrparator 
entr^; associated witii a target register of the 
decobed commit operation and generating a 
delated exception if the valid bit is set to invalid 
5 and. at the same time, checking the delayed excep- 
tion bit of a source register of the commit operation 
and, if the delayed exception bit is set. generating a 
delayed exception; and 

aborting an excepting instruction arKl transferring 
to control to an excqDtion handler. 

2. Thai method of reordering memory operations in 
superscalar or very long instruction word (VLIW) 
processors recited in claim 1 further comprising the 
stefjs of: 

determining if the decode instruction is other than 
an out-of-order load, a store or a commit instruc- 
tion; and 

ch©jWng the delayed exception bit for all source 
20 registers used by the instruction and if and delayed 
exception bits are set. setting corresponding 
delayed exception bits in target registers. 

3, A superscalar or very long instruction word (VLIW) 
25 prcjcessor capable of reordering memory opera- 
tions even for arbitrarily separated and ambiguous 
memory references conprising: 

a decoder for decoding instructions issued by the 
prpc^sor; 

30 a plurality of registers each having delayed excep- 
tion bits accessible as special purpose registers; 
functional means for deterntining if a decoded 
initnjction is an out-of-order load instaiction and, if 
sbr determining if the out-of-order load instruction 
35 generates an exception, said functional means set- 
tirtg a delayed exertion bit associated with a target 
reigister of the load instruction for an outof-order 
load instruction which generates an exception; 
an address comparator for saving a memory 
40 address of an out-of-order Icffld instruction which 
does not generate an exc^on, said address com- 
parator having a valid bit vtfhich is set for the saved 
memory address; 

said functional means determining if a decoded 
45 instruction is a store operation ; 

said address conparator comparing a range of 
memory addresses referenced t>y a decoded store 
ir^struction with all entries in the address compara- 
tor and. for each match of an entry in the address 
50 comparator, setting the valid bit of the correspond- 
ing entry to invalid; 

said functional means determining if a decoded 
ihstruction is a commit operation, checking the vaTtd 
lit of tfie address conparator entry associated with 
55 a target register of the decoded commit operaton 
and generating a delayed exception tf the valid bit ^ 
set to invalid and. at the same time, checking the 
delayed exception bit of a source register of the 
commit operation and. if the delayed exception bit is 



By: 



18006661233 



15 EP 0 742 512 A2 

set generating a delayed exception and aborting 
the excepting instruction; and 
an exception handler executing recovery ^code 
when an excepting instruction is aborted. 
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DECODE OPERATION ISSUED BY 
THIS PROCESSOR, OR DECODE 
STONE OPERATION ISSUED 
BY ANOTHER PRXESSOR 



lot 




SAVE RANGE OF MEMORY 
ADDRESS IN ASSOCIATED 
ADDRESS COMPARATOR 
ENTRY, 

SET VAUD BIT FOR ADDRESS 
COMPARATOR ENTRY TO TRUE 



104 



SET DX BIT TO TRUE IN 
TARGET REGISTER OF LOAD 



105 





SET VAUD BIT OF CORRESPONDING 
ENTRY TO FALSE 



112 



GENERATE DELAYED 
EXCEPTION 



I 



114 



GENERATE DELAYED 
EXCEPTION 




FIG.1A 
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SET DX BIT OF TARGET 
REGISTER OF INSTRUCTION 
TO TRUE 





y^120 




BRANCH TO EXECUTE RECOVERY CODE 



END 



FIG. IB 
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